Before we can start mapping in R, we need to load the libraries we use. We do this once at the beginning for all libraries used throughout this tutorial. Packages can be installed using the install.packages() function. An example is install.packages(‘tidyverse’).
library(HistData)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.6 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lingtypology)
library(sf)
## Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE
library(rworldmap)
## Loading required package: sp
## ### Welcome to rworldmap ###
## For a short introduction type : vignette('rworldmap')
library(maps)
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
library(classInt)
We will demonstrate the core logic of ggplot2 with a famous non-linguistic dataset before proceeding to plotting map data. Francis Galton compared the age of parents to the age of their children. This dataset is built into the R package HistData and we can access it using the data() function as follows and we can use the head() function to show the first six lines of this dataset.
data("Galton")
head(Galton)
## parent child
## 1 70.5 61.7
## 2 68.5 61.7
## 3 65.5 61.7
## 4 64.5 61.7
## 5 64.0 61.7
## 6 67.5 62.2
First, the Galton dataset is ‘piped’ to the ggplot() function using ‘%>%’, which is called a ‘pipe’. To specify points, we need their x- and y-values, so the aes() function instructs ggplot to draw from the parent and child column respectively. The ‘+’ symbol is used to add to the plot, in our case geom_point(), which draws from the specification of aesthetic mappings supplied in the first ggplot() command.
Galton %>%
ggplot(mapping = aes(x = parent, y = child)) + geom_point()
To increase the understandability of our plot, we can add random jitter to x- and y-values: one way of achieving this is by swapping geom_point() with geom_jitter(). We can additionally specify the argument ‘alpha = 0.5’, which controls the opacity of the points. Supplying the value 0.5 means that the points will be 50% opaque, thereby making dense regions darker. Finally, we add theme_classic(), which is a high-level property of the plot reducing visual clutter, such as the grey tiles ggplot2 plots by default.
Galton %>%
ggplot(mapping = aes(x = parent, y = child)) + geom_jitter(alpha = 0.5) + theme_classic()
Without much effort we can create an interactive map based on the package lingtypology. The map.feature() function from the lingtypology package can take a list of language names, as is done via the ‘concatenate’ function c() below. The resultant map shows one dot for each language.
map.feature(c("Finnish", "Karelian", "Swedish", "Estonian", "Danish", "North Saami"))
To create our first self-designed map within the ggplot2 framework, we will start by creating an empty world map to which we can then add our linguistic data as a layer in a second step. To retrieve the data for our base layer, we will use the map_data() function. We store this information in an R object named world.
world <- map_data("world")
head(world)
## long lat group order region subregion
## 1 -69.89912 12.45200 1 1 Aruba <NA>
## 2 -69.89571 12.42300 1 2 Aruba <NA>
## 3 -69.94219 12.43853 1 3 Aruba <NA>
## 4 -70.00415 12.50049 1 4 Aruba <NA>
## 5 -70.06612 12.54697 1 5 Aruba <NA>
## 6 -70.05088 12.59707 1 6 Aruba <NA>
With this data, we can produce a first map of the world. For this, we provide the ggplot() function with the data world we created and we specify that longitude and latitude will be used as the x- and y-axis units in the aesthetics. With the geom_map() function we instruct R to actually map the world data.
world_plot <- ggplot(data = world, aes(x = long, y = lat)) + geom_map(map = world,
aes(map_id = region))
world_plot
Now that we have our base map, we can add linguistic data to it. We will use data from the World Atlas of Language Structures (Dryer & Haspelmath, 2013) and we use the wals.feature() function, which allows us to import features from the Atlas. Here, we will import feature 81a, which is word order and store it as word_order.
word_order <- wals.feature(c("81a"))
## Don't forget to cite a source (modify in case of using individual chapters):
##
## Dryer, Matthew S. & Haspelmath, Martin (eds.) 2013. The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology.
## (Available online at https://wals.info/, Accessed on 2023-01-11.)
##
## @book{wals,
## address = {Leipzig},
## editor = {Matthew S. Dryer and Martin Haspelmath},
## publisher = {Max Planck Institute for Evolutionary Anthropology},
## title = {WALS Online},
## url = {https://wals.info/},
## year = {2013}
## }
In the next step we add this information in another layer to our map.
word_order_plot <- world_plot + geom_point(data = word_order, aes(x = longitude,
y = latitude, color = `81a`))
word_order_plot
We will also style the map a little. We add a minimal theme and change some design aspects.
word_order_plot + theme_minimal() + theme(legend.position = "bottom", plot.title = element_text(hjust = 0.5)) +
scale_color_brewer(palette = "Set2") + ggtitle("Word Order in World Languages") +
xlab("Latitude") + ylab("Longitude")
In the next part of this tutorial, we will focus on visualizing a more specific region to show how to narrow down the area of the map. To start, we import the geographical county information for the USA. To achieve this, we first retrieve the relevant location data in an object we call us_geo. The st_as_sf() function changes the way the map data is stored. Instead of having each point of longitude and latitude as its own row, we now store all location details for one county in one row.
us_geo <- st_as_sf(maps::map(database = "county", plot = FALSE, fill = TRUE))
head(us_geo)
## Simple feature collection with 6 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -88.01778 ymin: 30.24071 xmax: -85.06131 ymax: 34.2686
## Geodetic CRS: WGS 84
## ID geom
## 1 alabama,autauga MULTIPOLYGON (((-86.50517 3...
## 2 alabama,baldwin MULTIPOLYGON (((-87.93757 3...
## 3 alabama,barbour MULTIPOLYGON (((-85.42801 3...
## 4 alabama,bibb MULTIPOLYGON (((-87.02083 3...
## 5 alabama,blount MULTIPOLYGON (((-86.9578 33...
## 6 alabama,bullock MULTIPOLYGON (((-85.66866 3...
We can now easily map the counties of the US.
ggplot(data = us_geo) + geom_sf()
The next step is importing data to map to the counties. We do this using the read_csv() function, which in this case grabs the data from a URL using the url() function.
us_data <- read_csv(url("https://raw.githubusercontent.com/danaroemling/mapping-for-linguists/main/MAPPING_SWEARING.csv"))
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## county = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
us_data
## # A tibble: 3,076 x 53
## county ass asshole bastard bitch bitched bitchy bloody bullshit cock
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 alaba… 1.52e6 49600 9538 9.62e5 6995 8903 5087 120184 15897
## 2 alaba… 1.25e6 54318 6578 8.07e5 2334 7851 14004 98452 7002
## 3 alaba… 2.26e6 29188 3243 9.60e5 3243 6486 3243 74591 3243
## 4 alaba… 1.45e6 14629 2926 1.01e6 0 8777 0 105328 0
## 5 alaba… 5.59e5 72969 4230 5.07e5 2115 5288 3173 101523 9518
## 6 alaba… 2.17e6 56605 0 1.18e6 0 8708 0 182878 8708
## 7 alaba… 2.64e6 38680 11282 1.81e6 6447 4835 3223 164390 9670
## 8 alaba… 1.60e6 38763 8012 9.18e5 2166 5197 4115 95500 8446
## 9 alaba… 1.88e6 34756 5902 1.44e6 1312 1967 20329 155419 7869
## 10 alaba… 3.80e5 37028 1683 2.73e5 1683 6732 5049 40394 1683
## # … with 3,066 more rows, and 43 more variables: crap <dbl>, crappy <dbl>,
## # cunt <dbl>, damn <dbl>, damnit <dbl>, damned <dbl>, darn <dbl>, dick <dbl>,
## # dickhead <dbl>, douche <dbl>, douchebag <dbl>, dumbass <dbl>, dyke <dbl>,
## # fag <dbl>, faggot <dbl>, fatass <dbl>, freaking <dbl>, friggin <dbl>,
## # fuck <dbl>, fucked <dbl>, fucker <dbl>, fuckery <dbl>, fucking <dbl>,
## # goddamn <dbl>, gosh <dbl>, hell <dbl>, hoe <dbl>, homo <dbl>,
## # jackass <dbl>, motherfucker <dbl>, motherfucking <dbl>, nigger <dbl>,
## # piss <dbl>, pissed <dbl>, pissy <dbl>, pussy <dbl>, shit <dbl>,
## # shittiest <dbl>, shitty <dbl>, slut <dbl>, slutty <dbl>, twat <dbl>,
## # whore <dbl>
We now create a new object us_geo_data and match the information stored in the old objects by ID and county, so that geolocation information and linguistic data are combined.
us_geo_data <- left_join(us_geo, us_data, by = c(ID = "county"))
Finally, we plot the new object us_geo_data. Just like above, we pass our data to the ggplot() function and add geom_sf() since we’re handling polygons. Now, we also pass aesthetics to geom_sf() to say that the income should be mapped as the fill of the counties.
ggplot(data = us_geo_data) + geom_sf(aes(fill = fuck))
To make our map more informative, we are introducing class intervals for our data. First, we create the intervals.
f_quantiles <- classIntervals(us_geo_data$fuck, n = 5, style = "quantile")
f_quantiles
## style: quantile
## [0,889713) [889713,1243106) [1243106,1557093) [1557093,1934757)
## 615 615 615 615
## [1934757,9527509]
## 616
Then we assign each data point its corresponding interval.
us_geo_data <- mutate(us_geo_data, f_quantile = cut(fuck, f_quantiles$brks, include.lowest = TRUE,
dig.lab = 10))
In the next step we will use the class intervals we have introduced for mapping.
ggplot(data = us_geo_data) + geom_sf(aes(fill = f_quantile), lwd = 0.1, color = "grey") +
scale_fill_brewer(palette = "Purples")
In the last step we style the map.
ggplot(data = us_geo_data) + geom_sf(aes(fill = f_quantile), lwd = 0.1, color = "grey") +
scale_fill_brewer(palette = "Purples") + coord_sf(crs = "ESRI:102003") + ggtitle("'Fuck' Distribution in the US per County") +
xlab("Latitude") + ylab("Longitude") + labs(fill = "Intervals") + theme_minimal()
We again use read_csv to create a tibble with cities, two features and their counts, the corresponding proportion of the features and geolocation details. We can type gsa_data to check our new data.
gsa_data <- read_csv(url("https://raw.githubusercontent.com/danaroemling/mapping-for-linguists/main/MAPPING_DIALECT.csv"))
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## City = col_character(),
## Token1 = col_character(),
## Count1 = col_double(),
## Token2 = col_character(),
## Count2 = col_double(),
## Proportion = col_double(),
## Long = col_double(),
## Lat = col_double()
## )
gsa_data
## # A tibble: 350 x 8
## City Token1 Count1 Token2 Count2 Proportion Long Lat
## <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Aachen schau 117 guck 139 45.7 6.08 50.8
## 2 Aalen schau 12 guck 4 75 10.1 48.8
## 3 Abstatt schau 1 guck 1 50 9.29 49.1
## 4 Ahlen schau 3 guck 1 75 7.90 51.8
## 5 Albstadt schau 1 guck 1 50 9.02 48.2
## 6 Alfter schau 2 guck 1 66.7 7.01 50.7
## 7 Altenbeken schau 1 guck 1 50 8.95 51.8
## 8 Altenholz schau 2 guck 3 40 10.1 54.4
## 9 Ammerbuch schau 3 guck 1 75 8.97 48.6
## 10 Andernach schau 2 guck 1 66.7 7.41 50.4
## # … with 340 more rows
In order to map this data, we will need a base map of the GSA first. Similar to the world data available in ggplot2, rworldmap provides us with coordinates we can use to create this map. We store the coordinates in the world_map object. We then define the GSA as Austria, Germany and Switzerland by concatenating them in the object we call GSA using c(). The next code chunk gets the coordinates for the previously defined GSA and saves them in the object GSA_coord. If you want to adapt this code for your own maps, you will need to change the string assigned to the GSA object. For example, you could assign c(“Finland”, “Iceland”, “Estonia”, “Denmark”) This is the list of countries for which the longer code chunk finds the coordinates. The code was adapted from this tutorial: https://egallic.fr/en/european-map-using-r/.
world_map <- getMap()
GSA <- c("Austria", "Germany", "Switzerland")
GSA_map <- which(world_map$NAME %in% GSA)
GSA_coord <- lapply(GSA_map, function(i) {
df <- data.frame(world_map@polygons[[i]]@Polygons[[1]]@coords)
df$region = as.character(world_map$NAME[i])
colnames(df) <- list("long", "lat", "region")
return(df)
})
GSA_coord <- do.call("rbind", GSA_coord)
After this bit we can use the coordinates to make the base layer of the GSA map.
ggplot(data = GSA_coord) + geom_polygon(aes(x = long, y = lat, group = region)) +
coord_fixed(1.3)
In the last step we can combine the base layer with the data in gsa_data. In the geom_polygon() function we first pass the aesthetics, saying that longitude and latitude should be mapped and that the polygons should be grouped by their region, which in this case is just the national borders of the three countries. In the next step, the data is added in the geom_point() layer, just as above. We pass the data argument and again we pass aesthetics with longitude and latitude. We also want R to map the proportion of our counts to the color argument and the size argument reflects the combined count of our features.
ggplot(data = GSA_coord) + geom_polygon(aes(x = long, y = lat, group = region), color = "black",
size = 0.1, fill = "snow3") + coord_fixed(1.3) + geom_point(data = gsa_data,
aes(x = Long, y = Lat, color = Proportion, size = (Count1 + Count2))) + scale_color_gradient(low = "seagreen3",
high = "mediumpurple3") + guides(size = "none") + ggtitle("Schau vs Guck in the GSA") +
xlab("Latitude") + ylab("Longitude") + theme_minimal()